Building Your First Linear Regression Model

Imagine that you’re an environmental scientist interested in predicting daily levels of atmospheric ozone pollution in Los Angeles. Ozone is an allotrope of oxygen molecule that has three oxygen atoms instead of two. While ozone in the stratosphere protects us from the sun’s UV rays, products from burning fossil fuels can be converted into ozone at ground level, where it is toxic. Our job is to build a regression model that can predict ozone pollution levels based on the time of year & meteorological readings, such as humidity & temperature.

Let’s load the data & explore it.

data(Ozone, package = 'mlbench')
ozoneTib <- as_tibble(Ozone)
names(ozoneTib) <- c('Month', 'Date', 'Day', 'Ozone', 'Press_height', 'Wind', 'Humid', 'Temp_Sand', 'Temp_Monte', 'Inv_height', 'Press_grad', 'Inv_temp', 'Visib')
ozoneTib
## # A tibble: 366 × 13
##    Month Date  Day   Ozone Press_height  Wind Humid Temp_Sand Temp_Monte
##    <fct> <fct> <fct> <dbl>        <dbl> <dbl> <dbl>     <dbl>      <dbl>
##  1 1     1     4         3         5480     8    20        NA       NA  
##  2 1     2     5         3         5660     6    NA        38       NA  
##  3 1     3     6         3         5710     4    28        40       NA  
##  4 1     4     7         5         5700     3    37        45       NA  
##  5 1     5     1         5         5760     3    51        54       45.3
##  6 1     6     2         6         5720     4    69        35       49.6
##  7 1     7     3         4         5790     6    19        45       46.4
##  8 1     8     4         4         5790     3    25        55       52.7
##  9 1     9     5         6         5700     3    73        41       48.0
## 10 1     10    6         7         5700     3    59        44       NA  
## # … with 356 more rows, and 4 more variables: Inv_height <dbl>,
## #   Press_grad <dbl>, Inv_temp <dbl>, Visib <dbl>

We have a tibble containing 366 cases & 13 variables of daily meteorological & ozone readings. At present, the Month, Day, & Date variables are factors. Arguably this may make sense, but we’re going to treat them as numerics for this exercise. To do this, we use the handy mutate_all() function, which takes the data as the first argument & a transformation/function as the second argument. Here, we use as.numeric() to convert all the variables into the numeric class.

Next, we have some missing data in the data set. Missing data is okay in our predictor variables (we’ll deal with this later using imputation), but missing data fro the variable we’re trying to predict is not okay. Therefore, we remove the cases without any ozone measurement by piping the result of the mutate_all() call into the filter() function, where we remove cases with an NA value for Ozone.

ozoneClean <- mutate_all(ozoneTib, as.numeric) %>%
  filter(is.na(Ozone) == FALSE)
ozoneClean
## # A tibble: 361 × 13
##    Month  Date   Day Ozone Press_height  Wind Humid Temp_Sand Temp_Monte
##    <dbl> <dbl> <dbl> <dbl>        <dbl> <dbl> <dbl>     <dbl>      <dbl>
##  1     1     1     4     3         5480     8    20        NA       NA  
##  2     1     2     5     3         5660     6    NA        38       NA  
##  3     1     3     6     3         5710     4    28        40       NA  
##  4     1     4     7     5         5700     3    37        45       NA  
##  5     1     5     1     5         5760     3    51        54       45.3
##  6     1     6     2     6         5720     4    69        35       49.6
##  7     1     7     3     4         5790     6    19        45       46.4
##  8     1     8     4     4         5790     3    25        55       52.7
##  9     1     9     5     6         5700     3    73        41       48.0
## 10     1    10     6     7         5700     3    59        44       NA  
## # … with 351 more rows, and 4 more variables: Inv_height <dbl>,
## #   Press_grad <dbl>, Inv_temp <dbl>, Visib <dbl>

Note: Could we have imputed missing data in our target variable? Yes we could, but this has the potential to introduce bias into our model. This is because we’ll be training a model to predict values that were themselves generated by a model.

Let’s plot each of our predictor variables against Ozone to get an idea of the relationships in the data. We start with our usual trick of gathering the variables with the gather() function so we can plot them on separate facets.

ozoneUntidy <- gather(ozoneClean, key = 'Variable', value = 'Value', -Ozone)

ggplotly(
  ggplot(ozoneUntidy, aes(Value, Ozone)) +
    facet_wrap(~Variable, scale = 'free_x') +
    geom_point(size = 0.5) + geom_smooth() + 
    geom_smooth(method = 'lm', col = 'red') +
    theme_bw()
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

In our ggplot() call, we facet by Variable & allow the x-axes of the facets to vary by setting the scale argument equal to 'free_x'. Then, along with a geom_point() layer, we add two geom_smooth() layers. The first geom_smooth() is given no arguments & so uses the default settings. By default, geom_smooth() will draw a LOESS curve to the data (a curvy, local regression line) if there are fewwer than 1,000 cases, or a GAM curve if there are 1,000 or more cases. Either will give us an idea of the shape of the relationships. The second geom_smooth() layer specifically asks for the lm method (linear model), which draws a linear regression line that best fits the data. Drawing both of these will help us identify if there are relationships in the data that are nonlinear.

The resulting plot is shown above. Some of the predictors have a linear relationship with ozone levels, some have a nonlinear relationship, & some seem to have no relationship at all.

Imputing Missing Values

Linear regression cannot handling missing values. Therefore, to avoid having to throw away a large portion of our data set, we’ll use imputation to fill in the gaps. In previous lessons, we used mean imputation to replace missing values (NAs) with the mean of the variable. While this may work, it only uses the information within a single variable will take the same value, potentially biasing the model. Instead, we can actually use machine learning algorithms to predict the value of a missing observation, using all of the other variables in the data set.

There are multiple imputation methods that come packaged with mlr. These include methods such as imputeMean(), imputeMedian(), & imputeMode() (for replacing missing values with the mean, median & mode of each variable, respectively). However, the most important method is imputeLearner(). The imputeLearner() function lets us specify a supervised machine learning algorithm to predict what the missing values would have been, based on the information held in all the other variables. For example, if we want to impute missing values of a continuous variable, the process proceeds as follows:

  1. Split the data set into cases with & without missing values for this particular variable.
  2. Decide on a regression algorithm to predict what the missing values would have been.
  3. Considering only the cases without missing values, use the algorithm to predict the values of the variable with missing values, using the other variables in the data set (including the dependent variable we’re trying to predict in our final model).
  4. Considering only the cases with missing values, use the model learned in step 3 to predict the missing values based on the values of the other predictors.

We employ the same strategy when imputing categorical variables, except that we choose a classification algorithm instead of a regression one. So we end up using a supervised learning algorithm to fill in the blanks so that we can use another algorithm to train our final model.

So how do we choose an imputation algorithm? There are a few practical considerations, but as always it will depend & it may pay off to try different methods & see which one gives you the best performance. We can at least initially whittle it down to either a classification or regression algorithm, depending on whether the variable with missing values is continuous or categorical. Next, whether we have missing values in one or multiple variables makes a difference because if it’s the latter, we will need to choose an algorithm that can itslef handle missing values. For example, let’s say we try to use logistic regression to impute missing values of a categorical variable. We’ll get to step 3 of our procedure & stop because the other variables in the data (that the algorithm is trying to use to predict the categorical variable) also contain missing values. Logistic regression can’t handle that & will throw an error. If the only variable with missing values was the one we were trying to impute, this wouldn’t be a problem. Finally, the other consideration is computational budget. If the algorithm you’re using to learn the final model is already computationally expensive, using a computationally expensive algorithm to impute our missing values is added expense. Within these constraints, it’s often best to experiment with different imputation learner & see which one works best for the task at hand.

When doing any form of missing-value imputation, it’s extremely important to ensure that the data is either missing at random (MAR) or missing completely at random (MCAR) & not missing not at random (MNAR). If the data is MCAR, it means the likelihood of a missing value is not related to any variable in the data set. If the data is MAR, it means the likelihood of a missing value is related only to the value of the other variables in the data set. For example, someone might be less likely to fill in their salary on a form because of their age. In either of these situations, we can still build models that are unbiased due to the presence of missing data. But consider a situation where someone is less likely to fill in their salary on a form because their salary is low. This is an example of data missing not at random (MNAR), where the likelihood of a missing value depends on the value of the variable itself. In such a situation, you would likely build a model that is biased to overestimate the salaries of the people in your survey.

How do we tell if our data is MCAR, MAR, or MNAR? There are methods for distinguishing MCAR & MAR. For example, you could build a classification model that predicts whether a case has a missing value for a particular variable. If the model does better at predicting missing values than a random guess, then the data is MAR. If the model can’t do much better than a random guess, then the data is probably MCAR. Is there a way to tell whether data is MNAR? Unfortunately, no. Being confident your data is not MNAR depends on good experiment design & thoughtful examination of your predictor variables.

Note: There is a more powerful imputation technique called multiple imputation where we create many new data sets, replacing missing data with sensible values in each one. We then train a model on each of these imputed data sets & return the average model. While this is the most widely used imputation technique, it isn’t implemented in mlr. Check out the documentation for the mice package.

For our ozone data, we have missing values across several variables, & they’re all continuous variables. Therefore, we’ll choose a regression algorithm that can handle missing data: rpart. We will impute the missing values with the rpart decision tree algorithm. In lessons we covered before, we considered tree based algorithms for classification problems, but decision trees can be used to predict continuous variables too.

imputeMethod <- imputeLearner('regr.rpart')
imputeMethod
## $learn
## function (data, target, col, learner, features) 
## {
##     constructor = getTaskConstructorForLearner(learner)
##     if (is.null(features)) {
##         features = setdiff(names(data), target)
##     }
##     else {
##         not.ok = which(features %nin% names(data))
##         if (length(not.ok)) {
##             stopf("Features for imputation not found in data: '%s'", 
##                 collapse(features[not.ok]))
##         }
##         not.ok = which.first(target %in% features)
##         if (length(not.ok)) {
##             stopf("Target column used as feature for imputation: '%s'", 
##                 target[not.ok])
##         }
##         if (col %nin% features) {
##             features = c(col, features)
##         }
##     }
##     impute.feats = setdiff(features, col)
##     if (anyMissing(data[impute.feats]) && !hasLearnerProperties(learner, 
##         "missings")) {
##         has.na = vlapply(data[impute.feats], anyMissing)
##         wrong.feats = clipString(collapse(colnames(data[impute.feats])[has.na], 
##             ", "), 50L)
##         stopf("Feature(s) '%s' used for imputation has/have missing values, but learner '%s' does not support that!", 
##             wrong.feats, learner$id)
##     }
##     ind = !is.na(data[[col]])
##     task = constructor("impute", data = subset(data, subset = ind, 
##         select = features), target = col, check.data = TRUE, 
##         fixup.data = "quiet")
##     list(model = train(learner, task), features = features)
## }
## <bytecode: 0x7f85f31af638>
## <environment: 0x7f85f3774fd8>
## 
## $impute
## function (data, target, col, model, features) 
## {
##     x = data[[col]]
##     ind = is.na(x)
##     if (all(!ind)) {
##         return(x)
##     }
##     newdata = as.data.frame(data)[ind, features, drop = FALSE]
##     p = predict(model, newdata = newdata)$data$response
##     replace(x, ind, p)
## }
## <bytecode: 0x7f85f37721d8>
## <environment: 0x7f85f3774fd8>
## 
## $args
## $args$learner
## Learner regr.rpart from package rpart
## Type: regr
## Name: Decision Tree; Short name: rpart
## Class: regr.rpart
## Properties: missings,numerics,factors,ordered,weights,featimp
## Predict-Type: response
## Hyperparameters: xval=0
## 
## 
## $args$features
## NULL
## 
## 
## attr(,"class")
## [1] "ImputeMethod"
ozoneImp <- impute(as.data.frame(ozoneClean), classes = list(numeric = imputeMethod))
ozoneImp
## $data
##     Month Date Day Ozone Press_height Wind    Humid Temp_Sand Temp_Monte
## 1       1    1   4     3     5480.000    8 20.00000  33.00000   33.98000
## 2       1    2   5     3     5660.000    6 31.00000  38.00000   51.27180
## 3       1    3   6     3     5710.000    4 28.00000  40.00000   51.27180
## 4       1    4   7     5     5700.000    3 37.00000  45.00000   51.27180
## 5       1    5   1     5     5760.000    3 51.00000  54.00000   45.32000
## 6       1    6   2     6     5720.000    4 69.00000  35.00000   49.64000
## 7       1    7   3     4     5790.000    6 19.00000  45.00000   46.40000
## 8       1    8   4     4     5790.000    3 25.00000  55.00000   52.70000
## 9       1    9   5     6     5700.000    3 73.00000  41.00000   48.02000
## 10      1   10   6     7     5700.000    3 59.00000  44.00000   51.27180
## 11      1   11   7     4     5770.000    8 27.00000  54.00000   51.27180
## 12      1   12   1     6     5720.000    3 44.00000  51.00000   54.32000
## 13      1   13   2     5     5760.000    6 33.00000  51.00000   57.56000
## 14      1   14   3     4     5780.000    6 19.00000  54.00000   56.12000
## 15      1   15   4     4     5830.000    3 19.00000  58.00000   62.24000
## 16      1   16   5     7     5870.000    2 19.00000  61.00000   64.94000
## 17      1   17   6     5     5840.000    5 19.00000  64.00000   59.67714
## 18      1   18   7     9     5780.000    4 59.00000  67.00000   59.67714
## 19      1   19   1     4     5680.000    5 73.00000  52.00000   56.48000
## 20      1   20   2     3     5720.000    4 19.00000  54.00000   51.27180
## 21      1   21   3     4     5760.000    3 19.00000  54.00000   53.60000
## 22      1   22   4     4     5730.000    4 26.00000  58.00000   52.70000
## 23      1   23   5     5     5700.000    5 59.00000  69.00000   51.08000
## 24      1   24   6     6     5650.000    5 70.00000  51.00000   51.27180
## 25      1   25   7     9     5680.000    3 64.00000  53.00000   51.27180
## 26      1   26   1     5     5780.000    3 70.37500  56.00000   53.60000
## 27      1   27   2     6     5820.000    5 19.00000  59.00000   59.36000
## 28      1   28   3     6     5830.000    4 21.17949  59.00000   60.08000
## 29      1   29   4     6     5810.000    5 19.00000  64.00000   56.66000
## 30      1   30   5    11     5790.000    3 28.00000  63.00000   57.38000
## 31      1   31   6    10     5800.000    2 32.00000  63.00000   59.67714
## 32      2    1   7     7     5820.000    5 19.00000  62.00000   59.67714
## 33      2    2   1    12     5770.000    8 76.00000  63.00000   57.20000
## 34      2    3   2     9     5670.000    3 69.00000  54.00000   45.50000
## 35      2    4   3     2     5590.000    3 76.00000  36.00000   37.40000
## 36      2    5   4     3     5410.000    6 64.00000  31.00000   32.18000
## 37      2    6   5     3     5350.000    7 62.00000  30.00000   32.54000
## 38      2    7   6     2     5480.000    9 72.00000  36.00000   42.06000
## 39      2    8   7     3     5600.000    7 76.00000  42.00000   42.06000
## 40      2    9   1     3     5490.000   11 72.00000  37.00000   38.48000
## 41      2   10   2     4     5560.000   10 72.00000  41.00000   40.46000
## 42      2   11   3     6     5700.000    3 32.00000  46.00000   42.06000
## 43      2   12   4     8     5680.000    5 50.00000  51.00000   47.12000
## 44      2   13   5     6     5700.000    4 86.00000  55.00000   49.28000
## 45      2   14   6     4     5650.000    5 61.00000  41.00000   33.98000
## 46      2   15   7     3     5610.000    5 62.00000  41.00000   42.06000
## 47      2   16   1     7     5730.000    5 66.00000  49.00000   51.27180
## 48      2   17   2    11     5770.000    5 68.00000  45.00000   52.88000
## 49      2   18   3    13     5770.000    3 82.00000  55.00000   55.40000
## 50      2   19   4     4     5700.000    5 67.94444  45.00000   38.12000
## 51      2   20   5     6     5690.000    8 21.00000  41.00000   43.88000
## 52      2   21   6     5     5700.000    3 19.00000  45.00000   42.06000
## 53      2   22   7     4     5730.000   11 19.00000  51.00000   51.27180
## 54      2   23   1     4     5690.000    7 19.00000  53.00000   50.18000
## 55      2   24   2     6     5640.000    5 68.00000  50.00000   37.40000
## 56      2   25   3    10     5720.000    6 63.00000  60.00000   53.06000
## 57      2   26   4    15     5740.000    3 54.00000  54.00000   56.48000
## 58      2   27   5    23     5740.000    3 47.00000  53.00000   58.82000
## 59      2   28   6    17     5740.000    3 56.00000  53.00000   59.67714
## 60      2   29   7     7     5670.000    7 61.00000  44.00000   51.27180
## 61      3    1   1     2     5550.000   10 74.00000  40.00000   38.84000
## 62      3    2   2     3     5470.000    7 46.00000  30.00000   29.66000
## 63      3    3   3     3     5320.000   11 45.00000  25.00000   27.68000
## 64      3    4   4     5     5565.909    8 33.00000  39.00000   30.20000
## 65      3    5   5     4     5530.000    3 43.00000  40.00000   36.14000
## 66      3    6   6     6     5600.000    3 21.00000  45.00000   42.06000
## 67      3    7   7     7     5660.000    7 57.00000  51.00000   42.06000
## 68      3    8   1     7     5580.000    5 42.00000  48.00000   40.64000
## 69      3    9   2     6     5510.000    5 50.00000  45.00000   36.86000
## 70      3   10   3     3     5530.000    5 61.00000  47.00000   33.80000
## 71      3   11   4     2     5620.000    9 61.00000  43.00000   37.04000
## 72      3   12   5     8     5690.000    0 60.00000  49.00000   46.04000
## 73      3   13   6    12     5760.000    4 31.00000  56.00000   59.67714
## 74      3   14   7    12     5740.000    3 66.00000  53.00000   59.67714
## 75      3   15   1    16     5780.000    5 53.00000  61.00000   57.92000
## 76      3   16   2     9     5790.000    2 42.00000  63.00000   57.02000
## 77      3   17   3    24     5760.000    3 60.00000  70.00000   58.64000
## 78      3   18   4    13     5700.000    4 82.00000  57.00000   50.36000
## 79      3   19   5     8     5680.000    4 57.00000  35.00000   40.10000
## 80      3   20   6    10     5720.000    5 21.00000  52.00000   59.67714
## 81      3   21   7     8     5720.000    5 19.00000  59.00000   59.67714
## 82      3   22   1     9     5730.000    4 32.00000  67.00000   59.54000
## 83      3   23   2    10     5710.000    5 77.00000  57.00000   57.38000
## 84      3   24   3    13     5750.000    6 70.00000  54.69697   56.30000
## 85      3   25   4    14     5720.000    4 71.00000  42.00000   44.96000
## 86      3   26   5     9     5710.000    3 19.00000  55.00000   51.80000
## 87      3   27   6    11     5600.000    6 45.00000  40.00000   42.06000
## 88      3   28   7     7     5630.000    4 44.00000  39.00000   51.27180
## 89      3   29   1     9     5690.000    7 70.00000  57.00000   46.58000
## 90      3   30   2    12     5730.000    6 45.00000  58.00000   52.52000
## 91      3   31   3    12     5710.000    3 46.00000  62.00000   52.52000
## 92      4    1   4     8     5610.000    6 50.00000  51.00000   50.00000
## 93      4    2   5     9     5680.000    5 69.00000  61.00000   51.44000
## 94      4    3   6     5     5620.000    6 67.00000  34.00000   33.98000
## 95      4    4   7     4     5420.000    7 69.00000  35.00000   33.98000
## 96      4    5   1     4     5540.000    5 54.00000  35.00000   33.26000
## 97      4    6   2     9     5590.000    6 51.00000  48.00000   38.12000
## 98      4    7   3    13     5690.000    6 63.00000  59.00000   52.88000
## 99      4    8   4     5     5550.000    7 63.00000  41.00000   37.58000
## 100     4    9   5    10     5620.000    7 57.00000  58.00000   46.76000
## 101     4   10   6    10     5630.000    6 61.00000  51.00000   51.27180
## 102     4   11   7     7     5580.000    7 78.00000  46.00000   42.06000
## 103     4   12   1     5     5560.000    4 65.00000  40.00000   34.70000
## 104     4   13   2     4     5440.000    5 44.00000  35.00000   33.08000
## 105     4   14   3     7     5480.000    7 51.00000  46.00000   37.40000
## 106     4   15   4     3     5620.000    5 73.00000  39.00000   39.56000
## 107     4   16   5     4     5450.000   11 35.00000  32.00000   33.98000
## 108     4   17   6     7     5660.000    6 35.00000  47.00000   42.06000
## 109     4   18   7    11     5680.000    6 61.00000  50.00000   51.27180
## 110     4   19   1    15     5760.000    4 50.00000  65.00000   56.30000
## 111     4   20   2    22     5790.000    4 57.00000  66.00000   63.68000
## 112     4   21   3    17     5720.000    5 68.00000  69.00000   60.80000
## 113     4   22   4     7     5660.000    6 58.00000  59.00000   42.80000
## 114     4   23   5    10     5710.000    5 65.00000  64.00000   56.30000
## 115     4   24   6    19     5780.000    7 78.00000  68.00000   59.67714
## 116     4   25   7    18     5750.000    7 73.00000  49.00000   59.67714
## 117     4   26   1    12     5700.000    5 41.00000  52.00000   49.64000
## 118     4   27   2     6     5620.000    9 47.00000  56.00000   39.92000
## 119     4   28   3     9     5650.000    6 46.00000  55.00000   40.82000
## 120     4   29   4    19     5730.000    5 61.00000  66.00000   59.72000
## 121     4   30   5    21     5810.000    4 55.00000  74.00000   67.28000
## 122     5    1   6    29     5790.000    4 60.00000  76.00000   69.56115
## 123     5    2   7    16     5740.000    8 78.00000  70.00000   59.67714
## 124     5    3   1     5     5667.143    8 62.00000  61.00000   44.42000
## 125     5    4   2    11     5690.000    4 71.00000  67.00000   52.52000
## 126     5    5   3     2     5565.909    4 67.00000  45.00000   41.72000
## 127     5    6   4     2     5680.000    6 77.00000  41.00000   42.98000
## 128     5    7   5    12     5650.000    8 66.00000  61.00000   51.80000
## 129     5    8   6    16     5730.000    6 74.00000  68.00000   59.67714
## 130     5    9   7    22     5730.000    3 78.00000  69.00000   59.67714
## 131     5   10   1    20     5760.000    7 78.00000  74.00000   63.14000
## 132     5   11   2    27     5830.000    6 75.00000  74.00000   67.28000
## 133     5   12   3    33     5880.000    3 80.00000  80.00000   73.04000
## 134     5   13   4    25     5890.000    6 88.00000  84.00000   73.22000
## 135     5   14   5    31     5850.000    4 76.00000  78.00000   71.24000
## 136     5   15   6    18     5820.000    6 63.00000  80.00000   69.56115
## 137     5   16   7    16     5816.154    7 68.00000  73.00000   69.56115
## 138     5   17   1    24     5800.000    7 78.00000  76.00000   69.56115
## 139     5   18   2    16     5740.000    3 74.00000  74.00000   69.56115
## 140     5   19   3    12     5710.000    7 63.00000  66.00000   51.27180
## 141     5   20   4     9     5720.000    8 62.00000  66.00000   51.27180
## 142     5   21   5    12     5710.000    7 69.00000  63.00000   51.27180
## 143     5   22   6    16     5740.000    5 53.00000  69.00000   51.27180
## 144     5   24   1     8     5690.000    9 62.00000  62.00000   51.27180
## 145     5   25   2     9     5730.000    5 71.00000  67.00000   49.82000
## 146     5   26   3    29     5780.000    3 68.00000  80.00000   69.56115
## 147     5   27   4    20     5790.000    7 79.00000  76.00000   69.56115
## 148     5   28   5     5     5750.000    3 76.00000  65.00000   51.08000
## 149     5   29   6     5     5680.000    6 71.00000  65.00000   51.27180
## 150     5   30   7    11     5720.000    3 66.00000  63.00000   51.27180
## 151     5   31   1    12     5770.000    4 81.00000  62.00000   59.67714
## 152     6    1   2    19     5800.000    4 72.00000  68.00000   59.67714
## 153     6    2   3    17     5780.000    8 92.00000  68.00000   64.76000
## 154     6    3   4    19     5740.000    6 71.00000  69.00000   64.40000
## 155     6    4   5    16     5730.000    6 64.00000  66.00000   60.44000
## 156     6    5   6    14     5760.000    6 68.00000  70.00000   59.67714
## 157     6    6   7    10     5770.000    7 59.00000  70.00000   59.67714
## 158     6    7   1     9     5690.000    8 67.00000  64.00000   54.50000
## 159     6    8   2     7     5650.000    6 66.00000  61.00000   53.78000
## 160     6    9   3     5     5610.000    3 61.00000  52.00000   42.08000
## 161     6   10   4     2     5570.000    9 81.00000  48.00000   41.72000
## 162     6   11   5    12     5690.000    5 63.00000  59.00000   51.80000
## 163     6   12   6    22     5760.000    3 58.00000  67.00000   59.67714
## 164     6   13   7    17     5810.000    5 68.00000  66.00000   59.67714
## 165     6   14   1    26     5830.000    4 71.00000  74.00000   71.78000
## 166     6   15   2    27     5880.000    6 67.00000  83.00000   72.50000
## 167     6   16   3    14     5860.000    3 64.00000  78.00000   68.72000
## 168     6   17   4    11     5830.000    6 64.00000  75.00000   66.20000
## 169     6   18   5    23     5870.000    4 69.00000  84.00000   74.12000
## 170     6   19   6    26     5860.000    3 77.00000  81.00000   69.56115
## 171     6   20   7    21     5800.000    3 61.00000  79.00000   69.56115
## 172     6   21   1    15     5800.000    4 69.00000  79.00000   66.20000
## 173     6   22   2    20     5770.000    5 64.00000  65.00000   65.12000
## 174     6   23   3    15     5860.000    4 33.00000  81.00000   72.68000
## 175     6   24   4    18     5870.000    7 38.00000  84.00000   76.10000
## 176     6   25   5    26     5870.000    4 54.00000  83.00000   69.56115
## 177     6   26   6    19     5860.000    6 39.00000  90.00000   78.82250
## 178     6   27   7    13     5880.000    5 43.00000  90.00000   78.82250
## 179     6   28   1    30     5870.000    7 55.00000  93.00000   78.82250
## 180     6   29   2    26     5860.000    4 77.00000  88.00000   78.82250
## 181     6   30   3    15     5830.000    5 63.00000  72.00000   69.56115
## 182     7    1   4    16     5820.000    5 65.00000  72.00000   69.56115
## 183     7    2   5    16     5820.000    8 64.00000  70.00000   59.67714
## 184     7    3   6    19     5860.000    6 68.00000  78.00000   69.56115
## 185     7    4   7    23     5870.000    3 76.00000  87.00000   78.82250
## 186     7    5   1    28     5890.000    6 71.00000  91.00000   78.82250
## 187     7    6   2    34     5900.000    6 86.00000  87.00000   81.68000
## 188     7    7   3    33     5890.000    5 65.00000  91.00000   78.82250
## 189     7    9   5    24     5910.000    4 73.00000  88.00000   79.88000
## 190     7   10   6    17     5900.000    5 69.00000  83.00000   69.56115
## 191     7   11   7    10     5860.000    3 64.00000  78.00000   69.56115
## 192     7   12   1    14     5830.000    3 63.00000  79.00000   70.88000
## 193     7   13   2    13     5850.000    9 72.00000  77.00000   69.44000
## 194     7   14   3    17     5830.000    6 82.00000  81.00000   73.22000
## 195     7   15   4    15     5816.154    6 83.00000  76.00000   72.32000
## 196     7   16   5    22     5810.000    8 69.00000  76.00000   67.64000
## 197     7   17   6    19     5830.000    4 74.00000  78.00000   69.56115
## 198     7   18   7    20     5830.000    5 69.00000  75.00000   69.56115
## 199     7   19   1    25     5840.000    7 72.00000  82.00000   68.00000
## 200     7   20   2    28     5870.000    6 73.00000  84.00000   74.12000
## 201     7   21   3    29     5870.000    4 90.00000  86.00000   74.48000
## 202     7   23   5    23     5860.000    3 80.00000  80.00000   67.28000
## 203     7   24   6    26     5900.000    3 73.00000  80.00000   69.56115
## 204     7   25   7    14     5890.000    4 71.00000  84.00000   69.56115
## 205     7   26   1    13     5880.000    4 78.00000  84.00000   70.70000
## 206     7   27   2    26     5890.000    6 80.00000  81.00000   69.80000
## 207     7   28   3    22     5870.000    8 74.00000  85.00000   71.42000
## 208     7   29   4    11     5816.154    6 70.00000  79.00000   67.82000
## 209     7   30   5    15     5816.154    6 71.00000  72.00000   66.02000
## 210     7   31   6    14     5820.000    6 63.00000  73.00000   69.56115
## 211     8    1   7    13     5780.000    6 57.00000  72.00000   69.56115
## 212     8    2   1     9     5770.000    3 55.00000  68.00000   62.60000
## 213     8    3   2    12     5790.000    4 65.00000  65.00000   56.48000
## 214     8    4   3    15     5820.000    6 69.57459  64.00000   62.06000
## 215     8    5   4    12     5840.000    6 69.57459  75.00000   65.12000
## 216     8    6   5    15     5800.000    7 69.57459  69.00000   63.32000
## 217     8    7   6    25     5830.000    4 69.57459  69.00000   59.67714
## 218     8    8   7    18     5800.000    3 69.57459  72.00000   69.56115
## 219     8    9   1    14     5840.000    7 65.00000  79.00000   67.10000
## 220     8   10   2    22     5816.154    3 74.00000  78.00000   66.02000
## 221     8   11   3    24     5910.000    5 72.00000  81.00000   70.88000
## 222     8   12   4    19     5890.000    5 79.00000  80.00000   69.80000
## 223     8   13   5    16     5870.000    6 62.00000  76.00000   69.56115
## 224     8   14   6     7     5780.000    7 65.00000  59.00000   51.27180
## 225     8   15   7     2     5730.000    5 77.00000  55.00000   51.27180
## 226     8   16   1     4     5780.000    7 70.00000  66.00000   51.44000
## 227     8   17   2     6     5750.000    7 58.00000  64.00000   56.48000
## 228     8   18   3    12     5760.000    5 58.00000  62.00000   52.16000
## 229     8   19   4     9     5730.000    7 72.00000  67.00000   51.27180
## 230     8   20   5    15     5730.000    5 77.00000  74.00000   64.04000
## 231     8   21   6    17     5790.000    4 57.00000  74.00000   69.56115
## 232     8   22   7    13     5750.000    3 67.00000  70.00000   59.67714
## 233     8   23   1    20     5880.000    3 73.00000  77.00000   69.56115
## 234     8   24   2    22     5890.000    7 70.00000  83.00000   70.88000
## 235     8   25   3    24     5880.000    4 73.00000  81.00000   73.76000
## 236     8   26   4    26     5870.000    7 73.00000  73.00000   75.20000
## 237     8   27   5    32     5900.000    6 71.00000  87.00000   76.46000
## 238     8   28   6    33     5920.000    4 77.00000  89.00000   78.82250
## 239     8   29   7    27     5930.000    3 68.00000  92.00000   78.82250
## 240     8   30   1    38     5950.000    5 62.00000  92.00000   82.40000
## 241     8   31   2    23     5950.000    8 61.00000  93.00000   81.68000
## 242     9    1   3    19     5900.000    5 71.00000  93.00000   82.58000
## 243     9    2   4    19     5890.000    8 77.00000  86.00000   71.42000
## 244     9    3   5    15     5860.000    7 71.00000  76.00000   69.44000
## 245     9    4   6    28     5840.000    5 67.00000  81.00000   69.56115
## 246     9    5   7    10     5800.000    6 74.00000  78.00000   69.56115
## 247     9    6   1    14     5760.000    7 65.00000  73.00000   69.56115
## 248     9    7   2    26     5810.000    6 82.00000  80.00000   71.78000
## 249     9    8   3    17     5850.000    4 67.00000  81.00000   70.88000
## 250     9    9   4     3     5773.529    5 73.00000  69.00000   66.92000
## 251     9   10   5     2     5773.529    6 74.00000  59.00000   61.88000
## 252     9   11   6     3     5760.000    7 87.00000  52.00000   51.27180
## 253     9   12   7    14     5860.000    4 71.00000  63.00000   59.67714
## 254     9   13   1    29     5830.000    5 77.00000  72.00000   68.72000
## 255     9   14   2    18     5840.000    5 78.00000  75.00000   69.26000
## 256     9   15   3     3     5800.000    7 72.00000  55.00000   54.32000
## 257     9   16   4     7     5667.143   10 67.00000  59.00000   51.44000
## 258     9   17   5     9     5790.000    3 71.00000  61.00000   50.54000
## 259     9   18   6    19     5830.000    5 71.00000  71.00000   69.56115
## 260     9   19   7     8     5810.000    5 76.00000  71.00000   69.56115
## 261     9   21   2    23     5780.000    6 76.00000  72.00000   66.38000
## 262     9   22   3    13     5800.000    6 73.00000  75.00000   67.10000
## 263     9   24   5     7     5800.000    5 80.00000  65.00000   60.08000
## 264     9   25   6     3     5780.000    9 73.00000  61.00000   51.27180
## 265     9   26   7     5     5790.000    8 80.00000  60.00000   51.27180
## 266     9   27   1    11     5770.000    5 75.00000  64.00000   55.94000
## 267     9   28   2    12     5750.000    4 68.00000  61.00000   59.54000
## 268     9   29   3     5     5640.000    5 93.00000  63.00000   54.32000
## 269     9   30   4     4     5640.000    7 57.00000  62.00000   54.32000
## 270    10    1   5     5     5650.000    3 70.00000  59.00000   50.90000
## 271    10    2   6     4     5710.000    6 65.00000  56.00000   51.27180
## 272    10    3   7    10     5760.000    6 66.00000  59.00000   59.67714
## 273    10    4   1    17     5840.000    4 73.00000  72.00000   63.14000
## 274    10    5   2    26     5880.000    3 77.00000  71.00000   67.64000
## 275    10    6   3    30     5890.000    5 80.00000  75.00000   71.06000
## 276    10    7   4    18     5890.000    4 73.00000  71.00000   70.88000
## 277    10    8   5    12     5890.000    5 19.00000  71.00000   70.52000
## 278    10    9   6     7     5890.000    6 19.00000  73.00000   69.56115
## 279    10   10   7    15     5850.000    3 73.00000  78.00000   69.56115
## 280    10   11   1    12     5830.000    5 76.00000  73.00000   69.56115
## 281    10   12   2     7     5830.000    8 77.00000  71.00000   67.10000
## 282    10   13   3    28     5860.000    5 86.00000  73.00000   69.80000
## 283    10   14   4    22     5830.000    5 76.00000  71.00000   69.44000
## 284    10   15   5    18     5800.000    7 66.00000  66.00000   62.96000
## 285    10   16   6    14     5830.000    4 74.00000  69.00000   59.67714
## 286    10   17   7    24     5790.000    5 71.00000  69.00000   59.67714
## 287    10   18   1    10     5730.000    4 84.00000  64.00000   56.30000
## 288    10   19   2    14     5780.000    5 74.00000  65.00000   63.14000
## 289    10   20   3     9     5740.000    7 48.00000  54.00000   62.96000
## 290    10   21   4    12     5710.000    8 75.00000  62.00000   58.64000
## 291    10   22   5     7     5690.000    6 74.00000  56.00000   52.34000
## 292    10   23   6     7     5670.000    4 67.00000  55.00000   51.27180
## 293    10   24   7     6     5760.000    4 75.00000  58.00000   51.27180
## 294    10   25   1    13     5820.000    5 71.00000  48.00000   59.67714
## 295    10   26   2     5     5790.000    3 35.00000  54.00000   51.27180
## 296    10   27   3     3     5760.000    5 23.00000  57.00000   53.42000
## 297    10   28   4     7     5800.000    6 19.00000  60.00000   57.02000
## 298    10   29   5     8     5810.000    7 59.00000  61.00000   55.76000
## 299    10   30   6    10     5750.000    4 60.00000  63.00000   59.67714
## 300    10   31   7    12     5840.000    0 38.00000  65.00000   59.67714
## 301    11    1   1     7     5860.000    3 21.17949  66.00000   65.12000
## 302    11    2   2     5     5870.000    6 21.17949  68.00000   68.90000
## 303    11    3   3     6     5920.000    3 22.00000  71.00000   69.08000
## 304    11    4   4     4     5900.000    0 21.17949  70.00000   68.72000
## 305    11    5   5     5     5860.000    7 19.00000  70.00000   62.78000
## 306    11    6   6    11     5840.000    3 69.57459  70.00000   59.67714
## 307    11    7   7    20     5840.000    0 45.00000  68.00000   59.67714
## 308    11    8   1     4     5850.000    5 48.72727  64.00000   64.04000
## 309    11    9   2    14     5810.000    2 47.00000  69.00000   60.98000
## 310    11   10   3    16     5770.000    2 73.00000  59.00000   57.20000
## 311    11   11   4     5     5710.000    4 67.00000  49.00000   44.24000
## 312    11   12   5     3     5500.000    9 56.00000  39.00000   41.36000
## 313    11   13   6     5     5660.000    3 54.00000  50.00000   42.06000
## 314    11   14   7     1     5700.000    3 71.00000  46.00000   42.06000
## 315    11   15   1     5     5810.000    5 59.00000  54.00000   54.50000
## 316    11   16   2     4     5860.000    0 25.00000  60.00000   61.52000
## 317    11   17   3    11     5900.000    0 24.00000  62.00000   62.60000
## 318    11   18   4     6     5850.000    5 41.00000  65.00000   59.54000
## 319    11   19   5     8     5780.000    3 50.00000  66.00000   59.72000
## 320    11   20   6    14     5790.000    0 76.00000  66.00000   59.67714
## 321    11   21   7    18     5780.000    2 82.00000  63.00000   59.67714
## 322    11   22   1    12     5770.000    2 81.00000  62.00000   60.62000
## 323    11   23   2     9     5750.000    2 85.00000  60.00000   59.72000
## 324    11   24   3     7     5780.000    5 76.00000  63.00000   60.44000
## 325    11   25   4    14     5790.000    5 66.00000  60.00000   59.67714
## 326    11   26   5     4     5750.000    6 58.00000  58.00000   42.62000
## 327    11   27   6     3     5670.000    8 19.00000  34.00000   33.98000
## 328    11   28   7     3     5760.000    0 19.00000  36.00000   42.06000
## 329    11   29   1     3     5770.000    4 19.00000  44.00000   51.26000
## 330    11   30   2     3     5810.000    2 19.00000  53.00000   55.94000
## 331    12    1   3     3     5810.000    2 19.00000  52.00000   57.74000
## 332    12    2   4     3     5870.000    3 19.00000  53.00000   60.80000
## 333    12    3   5     3     5830.000    2 27.00000  58.00000   59.00000
## 334    12    4   6     6     5760.000    0 64.00000  55.00000   51.27180
## 335    12    5   7     6     5680.000    0 52.00000  50.00000   59.67714
## 336    12    6   1     5     5780.000    4 19.00000  48.00000   54.14000
## 337    12    7   2     3     5810.000    3 19.00000  51.00000   58.28000
## 338    12    8   3     4     5760.000    0 32.00000  62.00000   56.12000
## 339    12    9   4     7     5680.000    0 58.00000  40.00000   46.94000
## 340    12   10   5     5     5750.000    0 26.00000  44.00000   52.88000
## 341    12   11   6     5     5790.000    5 19.00000  49.00000   51.27180
## 342    12   12   7     4     5770.000    3 19.00000  53.00000   51.27180
## 343    12   13   1     3     5750.000    0 19.00000  53.00000   51.98000
## 344    12   14   2     2     5720.000    0 19.00000  53.00000   52.70000
## 345    12   15   3     5     5760.000    3 19.00000  55.00000   58.10000
## 346    12   16   4     3     5780.000    0 19.00000  51.00000   54.32000
## 347    12   17   5     4     5660.000    4 19.00000  54.00000   49.64000
## 348    12   18   6     4     5610.000    2 58.00000  48.00000   42.06000
## 349    12   19   7     6     5640.000    0 51.00000  53.00000   42.06000
## 350    12   20   1     6     5680.000    3 52.00000  49.00000   48.38000
## 351    12   21   2     3     5650.000    5 19.00000  48.00000   47.12000
## 352    12   22   3     4     5710.000    4 19.00000  51.00000   51.08000
## 353    12   23   4     3     5680.000    4 57.00000  47.00000   45.32000
## 354    12   24   5     8     5630.000    4 50.00000  50.00000   51.27180
## 355    12   25   6     5     5770.000    0 21.17949  49.00000   59.67714
## 356    12   26   7     3     5800.000    7 19.00000  51.00000   59.67714
## 357    12   27   1     2     5730.000    3 53.00000  51.00000   49.28000
## 358    12   28   2     3     5690.000    3 23.00000  51.00000   49.28000
## 359    12   29   3     5     5650.000    3 61.00000  50.00000   46.58000
## 360    12   30   4     1     5550.000    4 85.00000  39.00000   41.00000
## 361    12   31   5     2     5645.556    4 68.00000  37.00000   33.98000
##     Inv_height Press_grad Inv_temp Visib
## 1     5000.000     -15.00 30.56000   200
## 2     4960.907     -14.00 38.24255   300
## 3     2693.000     -25.00 47.66000   250
## 4      590.000     -24.00 55.04000   100
## 5     1450.000      25.00 57.02000    60
## 6     1568.000      15.00 53.78000    60
## 7     2631.000     -33.00 54.14000   100
## 8      554.000     -28.00 64.76000   250
## 9     2083.000      23.00 52.52000   120
## 10    2654.000      -2.00 48.38000   120
## 11    5000.000     -19.00 48.56000   120
## 12     111.000       9.00 63.14000   150
## 13     492.000     -44.00 64.58000    40
## 14    5000.000     -44.00 56.30000   200
## 15    1249.000     -53.00 75.74000   250
## 16    5000.000     -67.00 65.48000   200
## 17    5000.000     -40.00 63.32000   200
## 18     639.000       1.00 66.02000   150
## 19     393.000     -68.00 69.80000    10
## 20    5000.000     -66.00 54.68000   140
## 21    5000.000     -58.00 51.98000   250
## 22    5000.000     -26.00 51.98000   200
## 23    3044.000      18.00 52.88000   150
## 24    3641.000      23.00 47.66000   140
## 25     111.000     -10.00 59.54000    50
## 26     692.000     -25.00 67.10000     0
## 27     597.000     -52.00 70.52000    70
## 28    4106.941     -44.00 62.58364   150
## 29    1791.000     -15.00 64.76000   150
## 30     793.000     -15.00 65.84000   120
## 31     531.000     -38.00 75.92000    40
## 32     419.000     -29.00 75.74000   120
## 33     816.000      -7.00 66.20000     6
## 34    3651.000      62.00 49.10000    30
## 35    5000.000      70.00 37.94000   100
## 36    5000.000      28.00 32.36000   200
## 37    1341.000      18.00 45.86000    60
## 38    5000.000       0.00 38.66000   350
## 39    3799.000     -18.00 45.86000   250
## 40    5000.000      32.00 38.12000   350
## 41    5000.000      -1.00 37.58000   300
## 42    5000.000     -30.00 45.86000   300
## 43    5000.000      -8.00 45.50000   300
## 44    2398.000      21.00 53.78000   200
## 45    5000.000      51.00 36.32000   100
## 46    4281.000      42.00 41.36000   250
## 47    1161.000      27.00 52.88000   200
## 48    2778.000       2.00 55.76000   200
## 49     442.000      26.00 58.28000    40
## 50    1235.280      82.00 38.24255     2
## 51    5000.000     -30.00 42.26000   300
## 52    5000.000     -53.00 43.88000   300
## 53    5000.000     -43.00 49.10000   300
## 54    5000.000       7.00 49.10000   300
## 55    5000.000      24.00 42.08000   300
## 56    1341.000      19.00 59.18000   150
## 57    1318.000       2.00 64.58000   150
## 58     885.000      -4.00 67.10000    80
## 59     360.000       3.00 67.10000    40
## 60    3497.000      73.00 49.46000    40
## 61    5000.000      73.00 40.10000    80
## 62    5000.000      44.00 29.30000   300
## 63    5000.000      39.00 27.50000   200
## 64    5000.000      15.00 30.02000   500
## 65    5000.000     -12.00 33.62000   140
## 66    5000.000      -2.00 39.02000   140
## 67    5000.000      30.00 42.08000   140
## 68    3608.000      24.00 39.38000   100
## 69    5000.000      38.00 32.90000   140
## 70    5000.000      56.00 35.60000   200
## 71    5000.000      66.00 34.34000   120
## 72     613.000     -27.00 59.72000   300
## 73     334.000      -9.00 64.40000   300
## 74     567.000      13.00 61.88000   150
## 75     488.000     -20.00 64.94000     2
## 76     531.000     -15.00 71.06000    50
## 77     508.000       7.00 66.56000    70
## 78    1571.000      68.00 56.30000    17
## 79     721.000      28.00 55.40000   140
## 80     505.000     -49.00 67.28000   140
## 81     377.000     -27.00 73.22000   300
## 82     442.000      -9.00 75.74000   200
## 83     902.000      54.00 60.44000   250
## 84    3188.000      53.00 58.64000    80
## 85    1381.000       4.00 56.30000    60
## 86    5000.000     -16.00 50.00000   100
## 87    5000.000      38.00 46.94000   150
## 88    1302.000      40.00 52.70000   150
## 89    1292.000      -5.00 53.60000   200
## 90    5000.000     -14.00 52.70000   100
## 91     472.000      34.00 62.96000   300
## 92    1404.000      42.00 54.50000   120
## 93     944.000      35.00 55.76000   100
## 94    5000.000      75.00 35.24000   200
## 95    5000.000      41.00 30.92000   200
## 96    5000.000      62.00 33.44000   200
## 97    5000.000      44.00 42.08000   300
## 98    2014.000      31.00 53.42000   300
## 99    5000.000      56.00 37.22000   250
## 100   5000.000      27.00 47.66000   120
## 101    524.000      57.00 54.68000   140
## 102   5000.000      55.00 38.48000   200
## 103   5000.000      59.00 35.24000   140
## 104   5000.000      24.00 32.54000    80
## 105   2490.000      29.00 47.48000   300
## 106   5000.000     107.00 31.28000   100
## 107   5000.000      36.00 33.44000   300
## 108   5000.000      28.00 39.38000   200
## 109   1144.000      30.00 53.60000   120
## 110    547.000       1.00 66.92000   100
## 111    413.000      10.00 69.62000   120
## 112    610.000      46.00 63.68000    60
## 113   3638.000      81.00 51.26000   120
## 114   3848.000      45.00 56.84000   100
## 115   1479.000      40.00 68.00000   100
## 116   1108.000      55.00 65.48000    27
## 117    869.000       0.00 58.10000    40
## 118   5000.000      43.00 38.30000   140
## 119   5000.000      49.00 37.94000   150
## 120   1148.000      31.00 60.80000   100
## 121    856.000       4.00 75.38000   100
## 122    807.000      16.00 73.04000   120
## 123   2040.000      46.00 63.50000   150
## 124   5000.000      63.00 42.62000   100
## 125    314.000      60.00 59.00000   120
## 126   5000.000      77.00 42.62000    80
## 127   5000.000      75.00 40.82000   120
## 128   1410.000      20.00 55.22000   140
## 129    360.000      23.00 62.42000   120
## 130   1568.000      32.00 67.64000    70
## 131   1184.000      40.00 68.72000    80
## 132    898.000      24.00 73.40000    70
## 133    436.000       0.00 86.36000    40
## 134    774.000       6.00 86.00000    20
## 135   1181.000      50.00 79.88000    17
## 136   1991.000      47.00 69.62000    40
## 137   2057.000      71.00 67.28000    50
## 138   1597.000      56.00 68.00000    50
## 139   1184.000      52.00 69.44000    70
## 140   3005.000      58.00 59.18000    80
## 141   2880.000      53.00 57.38000   120
## 142   2116.600      66.00 56.30000   120
## 143   2125.000      64.00 59.00000   100
## 144   3720.000      74.00 50.90000   120
## 145   4337.000      66.00 59.36000   200
## 146   2053.000      31.00 72.86000   120
## 147   1958.000      70.00 70.52000    40
## 148   3644.000      86.00 59.36000    70
## 149   1368.000      75.00 58.46000   100
## 150   3539.000      73.00 53.60000   120
## 151   2785.000      49.00 63.32000   100
## 152    984.000      26.00 69.26000   120
## 153   1804.000      56.00 68.00000    70
## 154   3234.000      77.00 62.78000    80
## 155   3441.000      67.00 60.98000   100
## 156   1578.000      61.00 60.80000   100
## 157   1850.000      76.00 60.80000   120
## 158   2962.000      80.00 59.36000   120
## 159   2670.000      54.00 55.40000   120
## 160   5000.000      76.00 42.08000   150
## 161   5000.000      57.00 40.82000   140
## 162   5000.000      46.00 51.26000   140
## 163    987.000      28.00 63.86000   140
## 164   1148.000      43.00 66.92000   140
## 165    898.000     -24.00 77.90000    60
## 166    777.000      -1.00 82.58000    30
## 167   1279.000      75.00 71.60000    17
## 168   1046.000      69.00 68.72000    80
## 169   1167.000      50.00 74.30000    60
## 170    987.000      45.00 75.74000   100
## 171   1144.000      57.00 71.24000   120
## 172    977.000      60.00 70.70000   150
## 173    770.000      26.00 75.56000   120
## 174    629.000     -11.00 86.36000   140
## 175    337.000     -14.00 89.78000   140
## 176    590.000      26.00 85.10000   120
## 177    400.000      19.00 83.30000   120
## 178    580.000       9.00 87.26000    80
## 179    646.000      25.00 89.24000   140
## 180    826.000      41.00 84.38000   140
## 181    823.000      52.00 74.48000   150
## 182   2116.000      47.00 70.34000   120
## 183   2972.000      52.00 64.40000   120
## 184   2752.000      41.00 69.98000   140
## 185   1377.000      37.00 78.44000   100
## 186   1486.000      33.00 79.88000    50
## 187    990.000      22.00 85.10000    40
## 188    508.000      29.00 85.28000   100
## 189   1204.000      56.00 79.88000   100
## 190   2414.000      63.00 76.46000    60
## 191   2385.000      67.00 70.34000    50
## 192   2326.000      64.00 71.24000    70
## 193   3389.000      56.00 68.72000    80
## 194   2818.000      58.00 71.78000    80
## 195   3083.000      75.00 72.32000    80
## 196   2394.000      54.00 69.62000    90
## 197   2746.000      61.00 69.44000   120
## 198   2493.000      55.00 72.50000   120
## 199   1528.000      42.00 73.94000   100
## 200    111.000      40.00 78.08000    60
## 201   1899.000      45.00 76.46000    40
## 202   1289.000      32.00 75.20000    40
## 203    984.000      35.00 78.80000    70
## 204    836.000      28.00 81.50000    80
## 205    826.000      27.00 79.34000    80
## 206   1105.000      39.00 74.12000    80
## 207   1023.000      46.00 77.18000    80
## 208   1453.000      68.00 70.16000    80
## 209   2375.000      52.00 66.20000   100
## 210   2956.000      46.00 67.28000   120
## 211   2988.000      56.00 65.66000   150
## 212   4291.000      60.00 62.24000   200
## 213   3330.000      59.00 58.64000   150
## 214   2085.200      31.00 63.27119   150
## 215   2292.951      35.00 76.70686   150
## 216   2085.200      49.00 63.27119   150
## 217    829.775      30.00 76.70686   100
## 218   2292.951      43.00 63.27119   100
## 219   1233.000      30.00 70.52000   100
## 220   1450.000      36.00 69.80000    30
## 221   1069.000      28.00 74.30000    80
## 222    984.000      57.00 73.40000    70
## 223   1653.000      71.00 68.72000    60
## 224   3930.000      68.00 59.18000   150
## 225   5000.000      73.00 51.62000   200
## 226   5000.000      45.00 51.26000   200
## 227   4212.000      46.00 56.84000   200
## 228   5000.000      52.00 49.82000   250
## 229   5000.000      31.00 57.38000   300
## 230   1545.000      43.00 65.66000    70
## 231    994.000      44.00 69.62000   300
## 232   1125.000      55.00 68.00000   150
## 233    636.000      16.00 73.94000   300
## 234    748.000      32.00 77.00000    30
## 235    692.000      44.00 77.72000   100
## 236    807.000      39.00 78.80000   100
## 237    869.000      19.00 78.98000    17
## 238    800.000      24.00 85.64000    20
## 239    393.000       6.00 91.76000     4
## 240    557.000       0.00 90.68000    70
## 241    620.000      27.00 85.64000    30
## 242   1404.000      33.00 84.74000    70
## 243    898.000      21.00 80.60000    60
## 244    377.000      -2.00 83.30000    40
## 245    528.000      17.00 78.80000    50
## 246   2818.000      26.00 72.68000    70
## 247   3247.000      10.00 67.28000   140
## 248    895.000       0.00 78.08000   100
## 249    721.000       0.00 80.24000   120
## 250    774.000     -27.00 75.56000   100
## 251    134.000       0.00 77.18000    70
## 252   5000.000      39.00 51.80000   150
## 253   1965.000      13.00 60.98000    50
## 254   1853.000      10.00 70.88000    70
## 255   2342.000       7.00 71.42000    40
## 256   5000.000      56.00 51.62000    70
## 257   5000.000      37.00 47.48000   120
## 258   4028.000      35.00 55.04000   140
## 259   2716.000      26.00 63.68000   140
## 260   3671.000      31.00 65.84000   100
## 261   3795.000      31.00 66.92000    70
## 262   3120.000      35.00 66.92000    40
## 263   2667.000      17.00 63.50000   100
## 264   5000.000      39.00 52.70000   120
## 265   5000.000      36.00 48.92000   120
## 266    308.000      25.00 68.72000   140
## 267   2982.000      18.00 59.90000   120
## 268   5000.000      30.00 52.70000    70
## 269   5000.000      25.00 51.26000   150
## 270   5000.000      38.00 47.66000   200
## 271   5000.000      35.00 47.84000   200
## 272   3070.000      13.00 60.08000   200
## 273    830.000       0.00 72.14000    70
## 274    711.000      -9.00 75.56000    40
## 275   1049.000     -10.00 78.98000    50
## 276    511.000     -39.00 83.84000    17
## 277   5000.000     -40.00 67.64000    80
## 278   5000.000     -34.00 69.44000   250
## 279    377.000      -3.00 78.80000   200
## 280    862.000      27.00 73.58000     2
## 281    337.000     -17.00 81.14000    20
## 282    492.000      -2.00 82.22000     7
## 283   1394.000      13.00 75.02000    30
## 284   3146.000      27.00 64.04000    50
## 285   2234.000      11.00 66.74000    70
## 286   2109.000      21.00 69.62000    17
## 287   5000.000      23.00 54.50000    80
## 288   2270.000      -7.00 68.90000    50
## 289   2191.000     -13.00 68.72000    60
## 290   3448.000      12.00 58.64000    60
## 291   5000.000      13.00 48.92000    80
## 292   5000.000      11.00 49.46000    50
## 293   2719.000      25.00 56.84000    50
## 294   1899.000      21.00 62.06000    40
## 295   5000.000     -41.00 52.52000    40
## 296   5000.000     -21.00 50.90000   300
## 297   5000.000     -19.00 54.32000   200
## 298   2385.000      10.00 60.44000   150
## 299   1938.000       0.00 62.60000   100
## 300    590.000     -11.00 69.98000   100
## 301   2085.200     -32.00 62.58364    60
## 302   2085.200     -42.00 62.58364   150
## 303    328.000     -40.00 80.60000   150
## 304   2085.200     -43.00 62.58364   200
## 305   5000.000     -29.00 61.70000   300
## 306    829.775      -9.00 76.70686   120
## 307    597.000     -22.00 73.58000    30
## 308   2085.200     -25.00 62.58364   100
## 309    469.000      -4.00 71.78000    50
## 310   1541.000      18.00 63.14000    20
## 311   5000.000      24.00 41.90000   200
## 312   5000.000      15.00 41.72000   120
## 313   5000.000      27.00 44.60000   300
## 314   5000.000      54.00 42.80000   200
## 315   5000.000     -28.00 53.60000    70
## 316   5000.000     -38.00 63.50000   140
## 317   5000.000     -36.00 60.08000   150
## 318   2014.000     -20.00 69.98000   200
## 319    436.000       1.00 70.34000     4
## 320    830.000       3.00 66.02000    40
## 321   1112.000      -8.00 66.38000    30
## 322   1210.000     -17.00 67.82000    30
## 323    501.000     -22.00 70.88000     2
## 324    875.000     -15.00 68.90000     0
## 325   1601.000       7.00 62.06000    30
## 326   5000.000      59.00 41.90000    60
## 327   5000.000     -63.00 37.04000   150
## 328   5000.000     -52.00 41.00000   100
## 329   2280.000     -54.00 55.76000   250
## 330   2047.000     -43.00 63.50000   150
## 331   5000.000     -69.00 56.48000   200
## 332   3720.000     -50.00 61.34000   200
## 333    311.000     -24.00 69.98000   200
## 334   2536.000      28.00 56.48000    80
## 335   1154.000     -22.00 61.52000    60
## 336   2933.000     -40.00 59.90000   300
## 337   3064.000     -33.00 62.78000   200
## 338    826.000     -16.00 64.76000   300
## 339   5000.000       2.00 42.98000    50
## 340    111.000     -52.00 68.18000    40
## 341   5000.000     -48.00 54.68000    70
## 342   5000.000     -37.00 55.58000   150
## 343   5000.000     -26.00 51.08000   150
## 344   5000.000     -31.00 51.44000    70
## 345    948.000     -48.00 70.70000   200
## 346   5000.000     -50.00 50.90000   120
## 347   5000.000     -22.00 48.56000   150
## 348   3687.000     -10.00 46.94000   150
## 349   5000.000       0.00 44.24000    60
## 350   5000.000     -19.00 45.68000    70
## 351   5000.000     -28.00 45.32000   150
## 352   5000.000     -25.00 48.38000   300
## 353    508.000     -10.00 58.64000   100
## 354   2851.000      -5.00 50.00000    70
## 355   4106.941     -35.00 52.84870    40
## 356   3143.000     -37.22 60.26000   140
## 357    111.000     -14.00 72.50000   200
## 358   5000.000     -36.00 51.26000    70
## 359   3704.000      18.00 46.94000    40
## 360   5000.000       8.00 39.92000   100
## 361   5000.000      -3.00 37.22000    70
## 
## $desc
## Imputation description
## Target: 
## Features: 13; Imputed: 13
## impute.new.levels: TRUE
## recode.factor.levels: TRUE
## dummy.type: factor

We first use the imputeLearner() function to define what algorithm we’re going to use to impute the missing values. The only argument we supply to this function is the name of the learner, which in this case is 'regr.rpart'.

Next, we use the impute() function to create the imputed data set, to which the first argument is the data. We’ve wrapper our tibble inside the as.data.frame() function just to prevent repeated warnings about the data being a tibble & not a data frame (these can be safely ignored). We can specify different imputation techniques for different columns by supplying a named list to the cols argument. For example, we could say cols = list(var1 = imputeMean(), var2 = imputeLearner('regr.lm')). We can also specify different imputation techniques for different classes of variable (one technique for numeric variables, another for factors) using the classes argument in the same way. In the following listing, we use the classes argument to impute all the variables (they are all numeric) using the imputeMethod we defined.

This results in a data set we can access using ozoneImp$data, whose missing values have been replaced with predictions from a model learned by the rpart algorithm. Now we can define our task & learner using the imputed data set. By supplying 'regr.lm' as an argument to the makeLearner() function, we’re telling mlr that we want to use linear regression.

ozoneTask <- makeRegrTask(data = ozoneImp$data, target = 'Ozone')
ozoneTask
## Supervised task: ozoneImp$data
## Type: regr
## Target: Ozone
## Observations: 361
## Features:
##    numerics     factors     ordered functionals 
##          12           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
lin <- makeLearner('regr.lm')
lin
## Learner regr.lm from package stats
## Type: regr
## Name: Simple Linear Regression; Short name: lm
## Class: regr.lm
## Properties: numerics,factors,se,weights
## Predict-Type: response
## Hyperparameters:

Automating Feature Selection

Sometimes, it may be obvious which variables have no predictive value & can be removed from the analysis. Domain knowledge is also very important here, where we include variables in the model that we, as experts, believe to have some predictive value for the outcome we’re studying. But it’s often better to take a less subjective approach to feature selection, & allow an algorithmm to choose the relevant features for us.

There are two methods for automating feature selection:

  • Filter methods – Filter methods compare each of the predictors against the outcome variable, & calculate a metric of how much the outcome varies with the predictor. This metric could be a correlation: for example, if both variables are continuous. The predictor variables are ranked in order of this metric (which, in theory, ranks them in order of how much information they can contribute to the model), & we can choose to drop a certain number or proportion of the worst-performing variables from our model. The number or proportion of variables we drop can be tuned as a hyperparameter during model building.
  • Wrapper methods – With wrapper methods, rather than using a single, out-of-model statistic to estimate feature importance, we iteratively train out model with different predictor variables. Eventually, the combination of predictors that gives us the best performing model is chosen. There are different ways fo doing this, but one example is sequential forward selection. In sequential forward selection, we start with no predictors & then add predictors one by one. At each stop of the algorithm, the feature than results in the best model performance is chosen. Finally, when the addition of any more predictors doesn’t result in an improvement in performance, feature addition stops & the final model is trained on the selected predictors.

Which method do we choose? It boils down to this: wrapper methods may result in models that perform better, because we’re actually using the model we’re training to estimate predictor importance. However, because we’re training a fresh model at each iteration of the selection process (& each step may include other preprocessing steps such as imputation), wrapper methods tend to be computationally expensive. Filter methods, on the other hand, may or may not select the best-performing set of predictors but are much less computationally expensive.

The Filter Method for Feature Selection

We’ll start with an example of using the filter method. There are a number of metrics we can use to estimate predictor importance. To see the list of available filter methods built into mlr, run listFilterMethods(). Some of the filter methods will require you to first install the FSelector package. There are too many filter methods to describe in full, but common choices include:

  • Linear correlation - when both predictor & outcome are continuous
  • ANOVA when predictor is categorical & outcome is continuous
  • Chi-squared when predict & outcome are continuous
  • Random forest importance - can be used whether the predictors & outcomes are categorical or continuous

The default method used by mlr (because it doesn’t depend on whether the variables are categorical or continuous) is to build a random forest to predict the outcome & return the variables that contribute most to model predictions (using the out-of-bag error). In our ozone example, because both the predictors & outcome variable are continuous, we’ll use linear correlation to estimate variable importance.

First, we use the generateFilterValuesData() function to generate an importance metric for each predictor. The first argument is the task, which contains our dataset & lets the function know that Ozone is the target variable. The second, optional argument is method, to which we supply one of the methods listed by listFilterMethods(). In this example, we use 'linear.correlation'. By extracting the $data component of this object, we get the table of predictors with their Pearson correlation coefficients.

We’ll also plot the coefficients for easier interpretability with plotFilterValues().

filterVals <- generateFilterValuesData(ozoneTask, 
                                       method = 'linear.correlation')
filterVals
## FilterValues:
## Task: ozoneImp$data
##             name    type             filter      value
##  1:    Temp_Sand numeric linear.correlation 0.76977674
##  2:   Temp_Monte numeric linear.correlation 0.74159034
##  3:     Inv_temp numeric linear.correlation 0.72712691
##  4: Press_height numeric linear.correlation 0.58752354
##  5:   Inv_height numeric linear.correlation 0.57563391
##  6:        Humid numeric linear.correlation 0.45148094
##  7:        Visib numeric linear.correlation 0.41471463
##  8:   Press_grad numeric linear.correlation 0.23331823
##  9:         Date numeric linear.correlation 0.08205094
## 10:        Month numeric linear.correlation 0.05371420
## 11:          Day numeric linear.correlation 0.04151444
## 12:         Wind numeric linear.correlation 0.00468138
plotFilterValues(filterVals) + theme_bw()

Now that we have a way of ranking our predictors in order of their estimated importance, we can decinde how to ‘skim off’ our least informative ones. We do this using the filterFeatures() function, which takes the task as the first argument, our filterVals object as the fval argument, & either the abs, per, or threshold argument. The abs argument allows us to specify the absolute number of best predictors to retain. The per argument allows us to specify a top percentage of best predictors to retain. The threshold argument allows us to specify a value of our filtering metric (in this case, correlation coefficient) that a predictor must exceed in order to be retained. We could manually filter out predictors using one of these three methods, but we could also have the computer learn the best predictors to retain. We can wrap together our learner (linear regression) & filter method so that we can treat any of abs, per, & threshold as hyperparameters & tune them.

To wrap together our learner & filter method, we use the makeFilterWrapper() function, supplying the linear regression learner we defined as the learner argument & our filter metric as the fw.method argument.

filterWrapper <- makeFilterWrapper(learner = lin, 
                                   fw.method = 'linear.correlation')

When we wrap together a learner & a preprocessing step, the hyperparameters for both become available for tuning as part of our wrapper learner. In this situation, it means we can tune the abs, per, or threshold parameter using cross-validation, to select the best-performing features. In this example, we’re going to tune the absolute number of features to retain.

First, we define the hyperparameter space with makeParamSet() & define fw.abs as an integer between 1 & 12 (the minimum & maximum number of features we’re going to retain). Next, we define our grid search, using makeTuneControlGrid(). This will try every value of our hyperparameter. We define an ordinary 10-fold cross-validation strategy using makeResampleDesc() & then perform the tuning with tuneParams(). The first argument is our wrapper learner, & then we supply our task, cross-validation method, hyperparameter space, & search procedure.

lmParamSpace <- makeParamSet(
  makeIntegerParam('fw.abs', lower = 1, upper = 12)
)

gridSearch <- makeTuneControlGrid()
kFold <- makeResampleDesc('CV', iters = 10)
tunedFeats <- tuneParams(filterWrapper, task = ozoneTask, resampling = kFold, 
                          par.set = lmParamSpace, control = gridSearch)
tunedFeats
## Tune result:
## Op. pars: fw.abs=10
## mse.test.mean=20.6510640

Our tuning procedure picks the 10 predictors with the highest correlation with ozone as the best-performing combination. But what does mse.test.mean mean? Well, the performance metrics we used for classification, such as mean misclassification error, doesn’t make any sense when predicting continuous variables. For regression problems, there are three commonly used performance metrics:

  • Mean absolute error (MAE) - finds the absolute residual between each case & the model, adds them all up, & divides the number of cases. We can interpret this as the mean absolute distance of the cases from the model.
  • Mean square error (MSE) - similar to MAE but squares the residuals before finding their mean. This means MSE is more sensitive to outliers than MAE, because the size of the squared residual grows quadratically, the further from the model prediction it is. MSE is the default performance metric for regression learners in mlr.
  • Root mean square error (RMSE) - Because MSE squares the residual, its value isn’t on the same scale as the outcome variable. Instead, we take the square root of MSE to get RMSE. When tuning hyperparameters & comparing models, MSE & RMSE will always select the same models (because RMSE is simply a transformation of MSE), but RMSE has the benefit of being on the same scale as our outcome variable & so is more interpretable.

Using the MSE performance metric, our tuned filter method has concluded that retaining the 10 features with the highest correlation with the ozone level results in the best-performing model. We can now train a final model that include only these top 10 features in the task.

First, we create a new task that includes only the filtered features, using the filterFeatures() function. To this function, we supply the name of the existing task, the filterVals object, & the number of features to retain as the argument to abs. This value can be accessed as the $x component of tunedFeats & needs to be wrapped in unlist(); otherwise, the function will throw an error. This creates a new task that contains only the filtered predictors & retains Ozone as the target variable. Finally, we train the linear model using this task.

filteredTask <- filterFeatures(ozoneTask, fval = filterVals, 
                               abs = unlist(tunedFeats$x))
filteredTask
## Supervised task: ozoneImp$data
## Type: regr
## Target: Ozone
## Observations: 361
## Features:
##    numerics     factors     ordered functionals 
##          10           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
filteredModel <- train(lin, filteredTask)
filteredModel
## Model for learner.id=regr.lm; learner.class=regr.lm
## Trained on: task.id = ozoneImp$data; obs = 361; features = 10
## Hyperparameters:

The Wrapper Method for Feature Selection

With the filter method, we generate univariate statistics describing how each predictor relates to the outcome variable. This may result in selecting the most informative predictors, but it isn’t guaranteed. Instead, we can use the actual model we’re trying to train to determine which features help it make the best predictions, but it is computationally more expensive as we’re training a fresh model for every permutation of predictor variables.

There are four options to search for the best combination of predictors.

  • Exhaustive search - Grid search. It will try every possible combination of predictor variables in our dataset & select the one that performs best. This is guaranteed to find the best combination but can be prohibitively slow. for example, in our 12-predictor dataset, exhaustive search would need to try more than \(1.3 * 10^9\) different variable combinations.
  • Random search - Random Search from hyperparameter tuning. We define a number of iterations & randomly select feature combinations. The best combination after the final iteration wins. This is usually less intensive (depending on how many iterations you choose), but it isn’t guaranteed to find the best combination of features.
  • Sequential search - From a particular starting point, we either add or remove features at each step that imporve performance. This can be one of the following:
    • Forward search - We start with an empty model & sequentially add the feature that improves the model most until additional features no longer improve the performance.
    • Backward search - We start with all the features & remove the feature whose removal improves the model the most until additional removals no longer improve performance.
    • Floating forward search - Starting from an empty model, we either add one variable or remove one variable at each step, whichever improves the model the most, until neither an addition nor a removal improves model performance.
    • Floating backward search - The same as floating forward, except with a full model.
  • Genetic algorithm - This method, inspired by Darwinian evolution, finds pairs of feature combinations that act as ‘parents’ to ‘offspring’ variable combinations, which inherit the best-performing features. This method is cool, but can be computationally expensive as the feature space grows.

With so many options to choose from, where do we start? Exhaustive & genetic search prohibitively slow for a large feature space. While random search can alleviate this problem, sequential search is a good compromise between computational cost & probability of finding the best-performing feature combination. For this example, we’re going to use floating backward selection.

First, we define the search method usign the makeFeatSelControlSequential() function. We use 'sfbs' as the method argument to use a sequential floating backward selection. Then we use the selectFeatures() function to perform feature selection. To this function, we supply the learner, task, cross-validation strategy, & search method. When we run the function, every permutation of predictor variables is cross-validated using our kFold strategy to get an estimate of its performance.

featSelControl <- makeFeatSelControlSequential(method = 'sfbs')
selFeats <- selectFeatures(learner = lin, task = ozoneTask, resampling = kFold, control = featSelControl)
selFeats
## FeatSel result:
## Features (6): Month, Press_height, Humid, Temp_Sand, Temp_Mon...
## mse.test.mean=20.3337104

By printing the result of this process, we can see the algorithm selected six predictors that had slightly lower MSE value than the predictors selected in our filter method.

Now, just as we did for the filter method, we can create a new task using the imputed data that contains only those selected predictors, & train a model on it.

ozoneSelFeat <- ozoneImp$data[, c('Ozone', selFeats$x)]
ozoneSelFeatTask <- makeRegrTask(data = ozoneSelFeat, target = 'Ozone')
wrapperModel <- train(lin, ozoneSelFeatTask)
wrapperModel
## Model for learner.id=regr.lm; learner.class=regr.lm
## Trained on: task.id = ozoneSelFeat; obs = 361; features = 6
## Hyperparameters:

Including Imputation & Feature Selection in Cross-Validation

It’s been said many times before, but always include data-dependent pre-processing steps in your cross-validation. However, up this this point, we’ve only needed to consider a single pre-processing step. How do combine more than one? mlr makes this simple. When we wrap together a learner & a pre-processing step, we essentially create a new learner algorithm that includes that pre-processing. So, to include an additional pre-processing step, we simply wrap the wrapped learner. The result is a sort of Matryoshka doll of wrappers, where one is encapsulated by another, which is encapsulated by another, & so on.

Using this strategy, we can combine as many preprocessing steps as we like to create a pipeline. The innermost wrapper will always be used first, then the next innermost, & so on.

We’re going to make an impute wrapper & then pass it as the learner to a feature-selection wrapper. First, we redefine our imputation method using the imputeLearner() function. Then, we create an imputation wrapper using the makeImputeWrapper() function, which takes the learner as the first argument. We use list(numeric = imputeMethod) as the classes argument to apply this imputation strategy to all of our numeric predictors.

Next comes the neat bit: we create a feature-selection wrapper using makeFeatSelWrapper() & supply the imputation wrapper we created as the learner. This is a crucial step because we’re creating a wrapper with another wrapper. We set the cross-validation method as kFold & the method of searching feature combinations as featSelControl.

imputeMethod <- imputeLearner('regr.rpart')
imputeWrapper <- makeImputeWrapper(lin, 
                                   classes = list(numeric = imputeMethod))
featSelWrapper <- makeFeatSelWrapper(learner = imputeWrapper,
                                     resampling = kFold, 
                                     control = featSelControl)
featSelWrapper
## Learner regr.lm.imputed.featsel from package stats
## Type: regr
## Name: ; Short name: 
## Class: FeatSelWrapper
## Properties: numerics,factors,missings,weights,se
## Predict-Type: response
## Hyperparameters:

Now, let’s cross-validate our entire model-building process like good data scientists. We define a task using the ozoneClean tibble, which still contains missing data. Next, we define an ordinary 3-fold cross-validation strategy for our cross-validation procedure. Finally, we start parallelisation with parallelStartSocket() & start the cross-validation procedure by supplying the learner (the wrapped wrapper), task, & cross-validation strategy to the resample() function.

ozoneTaskWithNAs <- makeRegrTask(data = ozoneClean, target = 'Ozone')
kFold3 <- makeResampleDesc('CV', iters = 3)

parallelStartSocket(cpus = detectCores() - 1)

lmCV <- resample(featSelWrapper, ozoneTaskWithNAs, resampling = kFold3)

parallelStop()

lmCV
## Resample Result
## Task: ozoneClean
## Learner: regr.lm.imputed.featsel
## Aggr perf: mse.test.mean=21.5519674
## Runtime: 82.6973

The cross-validation process proceeds like this:

  1. Split the data into 3 folds.
  2. For each fold:
    • Use the rpart algorithm to impute the missing values.
    • Perform feature selection: Update template to support more tha two levels of nested ordered lists.
    • Use a selection method (such as backward search) to select combinations of features to train models on.
    • Use 10-fold cross-validation to evaluate the performance of each model.
  3. Return the best-performing model for each of the three outer folds.
  4. Return the mean MSE to give use our estimate of performance.

We can see that our model-building process gives us a mean MSE of approximately 21, suggesting a mean residual error of about 4.6 on the original ozone scale (taking the square root of 21).

Intepreting the Model

Due to their simple structure, linear models are usually quite simple to interpret, because we can look at the slopes for each predictor to infer how much the outcome variable is affected by each. However, whether these interpretations are justified or not depends on whether some model assumptions have been met, so we’ll interpret the model output & generate some diagnostic plots.

First, we need to extract the model information from our model object using the getLearnerModel() function. By calling summary() on the model data, we get an output with lots of information about our model.

wrapperModelData <- getLearnerModel(wrapperModel)
summary(wrapperModelData)
## 
## Call:
## stats::lm(formula = f, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1285  -3.0253  -0.4689   2.6848  13.7779 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.0320707 26.9086028   1.116 0.265146    
## Month        -0.2606420  0.0753338  -3.460 0.000607 ***
## Press_height -0.0084528  0.0050197  -1.684 0.093080 .  
## Humid         0.0815839  0.0137726   5.924 7.46e-09 ***
## Temp_Sand     0.2167663  0.0430415   5.036 7.58e-07 ***
## Temp_Monte    0.2656316  0.0633190   4.195 3.45e-05 ***
## Inv_height   -0.0005736  0.0001762  -3.255 0.001243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.471 on 354 degrees of freedom
## Multiple R-squared:  0.6863, Adjusted R-squared:  0.681 
## F-statistic: 129.1 on 6 and 354 DF,  p-value: < 2.2e-16

The Call component would normally tell us the formula we used to create the model (which variables & whether we added more complex relationships between them). Because we built this model using mlr, we unfortunately don’t get that information here, but the model formula is all of the selected predictors combined linearly together.

The Residuals component gives us some summary statistics about the model residuals. Here, we’re looking to see if the median is approximately 0 & that the first & third quartiles are approximately the same. If they aren’t, this might suggest the residuals are either not normally distributed, or heterescedastic. In both situations, not only could this negatively impact model performance, but it could make our interpretation of the slopes incorrect.

The Coefficients component shows us a table of model parameters & their standard errors. The intercept is the estimate of the ozone level when all other variables are 0. In this particular case, it doesn’t make sense for some of our variables to be 0 (month, for example) so we don’t draw too much interpretation from this. The estimates for the predictors are their slopes. For example, our model estimates than for every one-unit increase in the Temp_Sand variable, Ozone increases by about 0.2 (holding all other variables constant). The Pr(>|t|) column contains the p values that, in theory, represent the probability of seeing a slope this large if the population slope was actually 0. Use the p values to guide your model building process, by all means; but there are some problems associated with p values, so don’t put too much faith in them.

Finally, Residual standard error is the same as RMSE, Multiple R-squared is an estimate of the proportion of variance in the data accounted for by our model, & the F-statistic is the ratio of variance explained by our model to the variance not explained by the model. The p value here is an estimate of the probability that our model is better than just using the mean of Ozone to make predictions.

We can very quickly & easily print diagnostic plots for linear models in R by supplying the model data as the argument to plot(). We split the plotting device into four parts using the mfrow argument to the par() function. This means when we create our diagnostic plots (there will be four of them), they will be tiled in the same plotting window. These plots may help us identify flaws in our model that impact predictive performance.

par(mfrow = c(2, 2))
plot(wrapperModelData)

par(mfrow = c(1, 1))